Student’s t-test (one-sample, paired, two-sample)#
A t-test answers a very specific question:
Is a mean (or mean difference) far enough from a reference value that it would be surprising if the true mean were actually that reference value?
It’s the default tool for mean comparisons when:
you have numeric data (measurements, scores, times, amounts)
the population standard deviation \(\sigma\) is unknown (almost always)
sample sizes are small to medium, so estimating uncertainty matters
Learning goals#
By the end you should be able to:
pick the right t-test variant (one-sample vs paired vs two-sample)
compute the t statistic and degrees of freedom by hand
understand and interpret p-values and confidence intervals
implement the test end-to-end using NumPy only (no SciPy)
use interactive Plotly visuals to build intuition
import math
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) When do you use a t-test?#
Use a t-test when you want to compare means:
One-sample: “Is the mean different from a baseline?” (e.g., average delivery time vs SLA)
Paired: “Did the mean change within the same units?” (e.g., before/after intervention on the same users)
Two-sample: “Are the group means different?” (e.g., A/B test: variant B vs A)
A t-test is not a general “difference detector”. It’s not ideal when:
the data are extremely skewed or dominated by outliers
you care about medians or quantiles, not means
observations are dependent (e.g., time series without adjustments)
If those issues apply, consider robust or nonparametric alternatives (permutation tests, Wilcoxon/Mann–Whitney, trimmed-mean tests).
2) The three t-tests (same idea, different setup)#
All t-tests have the same structure:
define a null hypothesis \(H_0\) about a mean (or mean difference)
compute a t statistic = (estimate − null value) / (estimated standard error)
use the Student t distribution to measure how “extreme” that statistic is
Quick reference#
Test |
What is the parameter? |
Typical \(H_0\) |
|---|---|---|
One-sample |
mean \(\mu\) |
\(\mu = \mu_0\) |
Paired |
mean difference \(\mu_d\) |
\(\mu_d = 0\) |
Two-sample (Welch) |
mean difference \(\mu_1 - \mu_2\) |
\(\mu_1 - \mu_2 = 0\) |
Rule of thumb: for two independent groups, prefer Welch’s t-test (it does not assume equal variances).
3) Hypothesis testing vocabulary (and what it actually means)#
\(H_0\) (null): the “no effect / no difference” statement you test.
\(H_1\) (alternative): what you’ll believe if the data are inconsistent with \(H_0\).
\(\alpha\) (significance level): a chosen false-positive rate (often 0.05).
p-value: under \(H_0\), the probability of observing a test statistic at least as extreme as what you got.
Important interpretation points:
A p-value is not “the probability \(H_0\) is true”.
A small p-value means the data are unlikely under \(H_0\).
A large p-value does not prove “no effect”; it may mean “not enough evidence”.
A good habit: always pair the p-value with a confidence interval (CI) and an effect size.
4) Why the t distribution appears#
If the population standard deviation \(\sigma\) were known, the standardized mean
would follow a standard normal distribution under \(H_0\) (for normal data).
In reality, \(\sigma\) is unknown, so we estimate it with the sample standard deviation \(s\). Replacing \(\sigma\) with \(s\) injects extra uncertainty, and the distribution becomes heavier-tailed.
That heavier-tailed distribution is Student’s t, controlled by the degrees of freedom (df).
small df → heavier tails (more uncertainty)
large df → approaches Normal(0, 1)
# Student t density (no SciPy)
def t_pdf(t: np.ndarray, df: float) -> np.ndarray:
t = np.asarray(t, dtype=float)
df = float(df)
# log form for numerical stability
log_coeff = (
math.lgamma((df + 1) / 2)
- 0.5 * math.log(df * math.pi)
- math.lgamma(df / 2)
)
return np.exp(log_coeff) * (1 + (t**2) / df) ** (-(df + 1) / 2)
x = np.linspace(-6, 6, 2000)
fig = go.Figure()
for df in [1, 2, 5, 10, 30, 100]:
fig.add_trace(go.Scatter(x=x, y=t_pdf(x, df), mode="lines", name=f"t(df={df})"))
normal_pdf = (1 / np.sqrt(2 * np.pi)) * np.exp(-(x**2) / 2)
fig.add_trace(
go.Scatter(
x=x,
y=normal_pdf,
mode="lines",
name="Normal(0,1)",
line=dict(color="black", dash="dash"),
)
)
fig.update_layout(
title="Student's t distribution: heavier tails for small df",
xaxis_title="t",
yaxis_title="density",
)
fig.show()
5) The one-sample t-test (the core idea)#
Given observations \(x_1, \dots, x_n\):
sample mean: \(\bar{x}\)
sample standard deviation:
The standard error (estimated std of the sample mean) is:
The t statistic for testing \(H_0: \mu = \mu_0\) is:
Under \(H_0\) (and normal data),
Interpretation: \(t\) is “how many estimated standard errors away from \(\mu_0\) the sample mean is”.
6) Implementation from scratch (NumPy-only)#
We’ll implement:
one-sample t-test
two-sample t-test (pooled-variance and Welch)
paired t-test (as a one-sample test on differences)
To keep everything transparent, we’ll also compute:
p-values via numerical integration of the t density
critical values / confidence intervals via inverting the CDF with bisection
This is for learning. In production, use a dedicated stats library for speed and edge cases.
def t_cdf_numeric(t: float, df: float) -> float:
# Numerical CDF via symmetry + trapezoidal integration (no SciPy).
t = float(t)
df = float(df)
if t == 0.0:
return 0.5
sign = 1.0 if t > 0 else -1.0
t_abs = abs(t)
# Adaptive resolution: more points for larger |t|.
n = int(np.clip(np.ceil(5000 * t_abs), 4000, 200000))
grid = np.linspace(0.0, t_abs, n)
area = np.trapz(t_pdf(grid, df), grid)
return float(0.5 + sign * area)
def t_p_value(t_stat: float, df: float, alternative: str = "two-sided") -> float:
cdf = t_cdf_numeric(t_stat, df)
if alternative == "two-sided":
return float(2 * min(cdf, 1 - cdf))
if alternative == "greater":
return float(1 - cdf)
if alternative == "less":
return float(cdf)
raise ValueError("alternative must be 'two-sided', 'greater', or 'less'")
def t_ppf_numeric(p: float, df: float, *, tol: float = 1e-6, max_iter: int = 80) -> float:
# Inverse CDF via bisection (sufficient for CI/critical values).
p = float(p)
df = float(df)
if not (0.0 < p < 1.0):
raise ValueError("p must be in (0, 1)")
if p == 0.5:
return 0.0
# Symmetry: solve on positive side.
sign = 1.0 if p > 0.5 else -1.0
p_target = p if p > 0.5 else 1.0 - p
lo, hi = 0.0, 50.0
while t_cdf_numeric(hi, df) < p_target:
hi *= 2
if hi > 1e6:
raise RuntimeError("Failed to bracket quantile")
for _ in range(max_iter):
mid = 0.5 * (lo + hi)
if t_cdf_numeric(mid, df) < p_target:
lo = mid
else:
hi = mid
if (hi - lo) < tol:
break
return sign * 0.5 * (lo + hi)
def ci_two_sided(estimate: float, se: float, df: float, alpha: float = 0.05) -> tuple[float, float]:
t_crit = t_ppf_numeric(1 - alpha / 2, df)
return (estimate - t_crit * se, estimate + t_crit * se)
def one_sample_t_test(
x: np.ndarray,
*,
mu0: float = 0.0,
alternative: str = "two-sided",
alpha: float = 0.05,
) -> dict:
x = np.asarray(x, dtype=float)
if x.ndim != 1 or x.size < 2:
raise ValueError("x must be a 1D array with at least 2 values")
n = x.size
mean = float(x.mean())
s = float(x.std(ddof=1))
se = s / np.sqrt(n)
estimate = mean - mu0
t_stat = estimate / se
df = float(n - 1)
p = t_p_value(t_stat, df, alternative=alternative)
ci = ci_two_sided(estimate, se, df, alpha=alpha)
return {
"test": "one-sample t-test",
"n": int(n),
"df": df,
"mu0": float(mu0),
"estimate": float(estimate),
"t_stat": float(t_stat),
"p_value": float(p),
"ci": tuple(map(float, ci)),
"reject_at_alpha": bool(p <= alpha),
}
def two_sample_t_test(
x: np.ndarray,
y: np.ndarray,
*,
diff0: float = 0.0,
equal_var: bool = False,
alternative: str = "two-sided",
alpha: float = 0.05,
) -> dict:
x = np.asarray(x, dtype=float)
y = np.asarray(y, dtype=float)
if x.ndim != 1 or y.ndim != 1 or x.size < 2 or y.size < 2:
raise ValueError("x and y must be 1D arrays with at least 2 values each")
n1, n2 = x.size, y.size
m1, m2 = float(x.mean()), float(y.mean())
s1, s2 = float(x.std(ddof=1)), float(y.std(ddof=1))
raw_diff = m1 - m2
estimate = raw_diff - diff0
if equal_var:
sp2 = ((n1 - 1) * s1**2 + (n2 - 1) * s2**2) / (n1 + n2 - 2)
se = float(np.sqrt(sp2 * (1 / n1 + 1 / n2)))
df = float(n1 + n2 - 2)
variant = "two-sample t-test (pooled variance)"
else:
se2 = s1**2 / n1 + s2**2 / n2
se = float(np.sqrt(se2))
df = float(se2**2 / ((s1**2 / n1) ** 2 / (n1 - 1) + (s2**2 / n2) ** 2 / (n2 - 1)))
variant = "two-sample t-test (Welch)"
t_stat = estimate / se
p = t_p_value(t_stat, df, alternative=alternative)
ci = ci_two_sided(estimate, se, df, alpha=alpha)
return {
"test": variant,
"n1": int(n1),
"n2": int(n2),
"df": float(df),
"diff0": float(diff0),
"estimate": float(estimate),
"t_stat": float(t_stat),
"p_value": float(p),
"ci": tuple(map(float, ci)),
"reject_at_alpha": bool(p <= alpha),
"means": (m1, m2),
"stds": (s1, s2),
}
def paired_t_test(
before: np.ndarray,
after: np.ndarray,
*,
diff0: float = 0.0,
alternative: str = "two-sided",
alpha: float = 0.05,
) -> dict:
before = np.asarray(before, dtype=float)
after = np.asarray(after, dtype=float)
if before.shape != after.shape or before.ndim != 1 or before.size < 2:
raise ValueError("before and after must be 1D arrays of the same length (>= 2)")
d = after - before
res = one_sample_t_test(d, mu0=diff0, alternative=alternative, alpha=alpha)
res["test"] = "paired t-test (one-sample on differences)"
res["mean_before"] = float(before.mean())
res["mean_after"] = float(after.mean())
return res
def plot_t_p_value(
t_stat: float,
df: float,
*,
alternative: str = "two-sided",
title: str | None = None,
):
x_max = max(6.0, abs(float(t_stat)) * 1.6 + 1.0)
x = np.linspace(-x_max, x_max, 3000)
y = t_pdf(x, df)
fig = go.Figure()
fig.add_trace(go.Scatter(x=x, y=y, mode="lines", name=f"t(df={df:.2f})"))
fill_color = "rgba(200, 30, 30, 0.25)"
if alternative == "two-sided":
cut = abs(float(t_stat))
masks = [x <= -cut, x >= cut]
for i, mask in enumerate(masks):
fig.add_trace(
go.Scatter(
x=np.concatenate([x[mask], x[mask][::-1]]),
y=np.concatenate([y[mask], np.zeros_like(y[mask])]),
fill="toself",
mode="lines",
line=dict(color="rgba(0,0,0,0)"),
fillcolor=fill_color,
name="p-value area" if i == 0 else None,
showlegend=(i == 0),
)
)
fig.add_vline(x=cut, line_dash="dash", line_color="rgba(200, 30, 30, 0.9)")
fig.add_vline(x=-cut, line_dash="dash", line_color="rgba(200, 30, 30, 0.9)")
elif alternative == "greater":
cut = float(t_stat)
mask = x >= cut
fig.add_trace(
go.Scatter(
x=np.concatenate([x[mask], x[mask][::-1]]),
y=np.concatenate([y[mask], np.zeros_like(y[mask])]),
fill="toself",
mode="lines",
line=dict(color="rgba(0,0,0,0)"),
fillcolor=fill_color,
name="p-value area",
)
)
fig.add_vline(x=cut, line_dash="dash", line_color="rgba(200, 30, 30, 0.9)")
elif alternative == "less":
cut = float(t_stat)
mask = x <= cut
fig.add_trace(
go.Scatter(
x=np.concatenate([x[mask], x[mask][::-1]]),
y=np.concatenate([y[mask], np.zeros_like(y[mask])]),
fill="toself",
mode="lines",
line=dict(color="rgba(0,0,0,0)"),
fillcolor=fill_color,
name="p-value area",
)
)
fig.add_vline(x=cut, line_dash="dash", line_color="rgba(200, 30, 30, 0.9)")
else:
raise ValueError("alternative must be 'two-sided', 'greater', or 'less'")
p = t_p_value(t_stat, df, alternative=alternative)
fig.update_layout(
title=title or f"Tail area = p-value (t={t_stat:.3f}, df={df:.2f}, p≈{p:.4f})",
xaxis_title="t",
yaxis_title="density",
)
return fig
7) Example: one-sample t-test#
Scenario: you measure a metric (say, average task completion time) and want to test whether the mean differs from a baseline \(\mu_0\).
We’ll generate a small synthetic sample and test \(H_0: \mu = \mu_0\).
# Synthetic data: mean slightly above baseline
mu0 = 10.0
n = 18
x = rng.normal(loc=10.8, scale=1.5, size=n)
res_1s = one_sample_t_test(x, mu0=mu0, alternative="two-sided", alpha=0.05)
res_1s
{'test': 'one-sample t-test',
'n': 18,
'df': 17.0,
'mu0': 10.0,
'estimate': 0.6760694827820508,
't_stat': 2.1448991639983137,
'p_value': 0.04670272951904608,
'ci': (0.011058309525378007, 1.3410806560387236),
'reject_at_alpha': True}
fig = plot_t_p_value(res_1s["t_stat"], res_1s["df"], title="One-sample t-test: p-value as tail area")
fig.show()
mean = float(np.mean(x))
ci_low, ci_high = res_1s["ci"]
fig = px.histogram(
x=x,
nbins=12,
title="Raw observations (with baseline and estimated mean)",
labels={"x": "value"},
)
fig.add_vline(x=mu0, line_dash="dash", line_color="black", annotation_text="μ₀")
fig.add_vline(x=mean, line_dash="solid", line_color="rgba(200,30,30,0.9)", annotation_text="x̄")
fig2 = go.Figure()
fig2.add_trace(
go.Scatter(
x=[res_1s["estimate"]],
y=["mean − μ₀"],
mode="markers",
marker=dict(size=10),
error_x=dict(
type="data",
symmetric=False,
array=[ci_high - res_1s["estimate"]],
arrayminus=[res_1s["estimate"] - ci_low],
),
)
)
fig2.add_vline(x=0.0, line_dash="dash", line_color="black")
fig2.update_layout(
title="Estimated mean difference with 95% CI",
xaxis_title="difference",
yaxis_title="",
)
fig.show()
fig2.show()
Interpreting the one-sample result#
If
p_value <= α, you reject \(H_0\) at level \(\alpha\).The CI tells you a range of plausible mean differences (mean − \(\mu_0\)).
A useful equivalence (two-sided tests):
Rejecting \(H_0: \mu=\mu_0\) at level \(\alpha\) is the same as the \((1-\alpha)\) CI for \(\mu\) not containing \(\mu_0\).
8) Example: two-sample t-test (Welch vs pooled)#
Two independent groups (A and B). We want to test:
Welch vs pooled-variance (Student) t-test#
Welch (recommended default): does not assume equal variances.
Pooled variance: assumes the two populations have the same variance.
If you’re unsure, use Welch.
# Synthetic A/B data with different variances (Welch is appropriate)
n_a, n_b = 24, 20
A = rng.normal(loc=0.0, scale=1.0, size=n_a)
B = rng.normal(loc=0.55, scale=1.5, size=n_b)
res_welch = two_sample_t_test(A, B, equal_var=False)
res_pooled = two_sample_t_test(A, B, equal_var=True)
res_welch, res_pooled
({'test': 'two-sample t-test (Welch)',
'n1': 24,
'n2': 20,
'df': 30.897981559364187,
'diff0': 0.0,
'estimate': -0.3018367277038695,
't_stat': -0.97827544805771,
'p_value': 0.33553605768968287,
'ci': (-0.9311915269948647, 0.32751807158712576),
'reject_at_alpha': False,
'means': (0.1793406145972717, 0.48117734230114123),
'stds': (0.7546575962141199, 1.1955516622986013)},
{'test': 'two-sample t-test (pooled variance)',
'n1': 24,
'n2': 20,
'df': 42.0,
'diff0': 0.0,
'estimate': -0.3018367277038695,
't_stat': -1.0182971425705634,
'p_value': 0.3143669449743095,
'ci': (-0.9000228288264907, 0.29634937341875167),
'reject_at_alpha': False,
'means': (0.1793406145972717, 0.48117734230114123),
'stds': (0.7546575962141199, 1.1955516622986013)})
values = np.concatenate([A, B])
groups = np.array(["A"] * A.size + ["B"] * B.size)
fig = px.violin(
x=groups,
y=values,
box=True,
points="all",
title="Two groups: distribution view (A vs B)",
labels={"x": "group", "y": "value"},
)
fig.show()
ci_low, ci_high = res_welch["ci"]
fig2 = go.Figure()
fig2.add_trace(
go.Scatter(
x=[res_welch["estimate"]],
y=["A − B"],
mode="markers",
marker=dict(size=10),
error_x=dict(
type="data",
symmetric=False,
array=[ci_high - res_welch["estimate"]],
arrayminus=[res_welch["estimate"] - ci_low],
),
)
)
fig2.add_vline(x=0.0, line_dash="dash", line_color="black")
fig2.update_layout(
title="Estimated mean difference (Welch) with 95% CI",
xaxis_title="difference",
yaxis_title="",
)
fig2.show()
fig = plot_t_p_value(res_welch["t_stat"], res_welch["df"], title="Welch t-test: p-value as tail area")
fig.show()
9) Example: paired t-test#
Paired data means each “before” is linked to an “after” for the same unit (person, device, country, …).
The trick: compute the difference per pair and run a one-sample t-test on those differences.
Then test \(H_0: \mu_d = 0\).
# Synthetic paired data: the same units measured twice
n = 18
before = rng.normal(loc=50, scale=8, size=n)
# after is correlated with before + an average improvement
after = before + rng.normal(loc=2.0, scale=3.0, size=n)
res_paired = paired_t_test(before, after)
res_paired
{'test': 'paired t-test (one-sample on differences)',
'n': 18,
'df': 17.0,
'mu0': 0.0,
'estimate': 0.903508293293421,
't_stat': 1.7405904074002199,
'p_value': 0.0998231022246423,
'ci': (-0.19165794524122026, 1.9986745318280623),
'reject_at_alpha': False,
'mean_before': 50.07036480566932,
'mean_after': 50.97387309896274}
fig = go.Figure()
for i in range(n):
fig.add_trace(
go.Scatter(
x=["before", "after"],
y=[before[i], after[i]],
mode="lines+markers",
line=dict(color="rgba(0,0,0,0.25)"),
marker=dict(size=6),
showlegend=False,
)
)
fig.update_layout(
title="Paired measurements: each line is one unit",
xaxis_title="",
yaxis_title="value",
)
fig.show()
diffs = after - before
fig2 = px.histogram(
x=diffs,
nbins=12,
title="Paired differences (after − before)",
labels={"x": "difference"},
)
fig2.add_vline(x=0.0, line_dash="dash", line_color="black")
fig2.show()
fig = plot_t_p_value(res_paired["t_stat"], res_paired["df"], title="Paired t-test: p-value as tail area")
fig.show()
10) Interpreting the output (t, df, p, CI)#
t statistic#
large \(|t|\) → estimate is many standard errors away from the null
degrees of freedom (df)#
df controls tail heaviness. Smaller df means more uncertainty because you estimated more from limited data.
p-value#
two-sided: “How likely is a value with \(|T| \ge |t_{obs}|\) under \(H_0\)?”
one-sided: tail probability only on the direction you care about
confidence interval#
For two-sided tests, the \((1-\alpha)\) CI and the test are consistent:
if the CI excludes 0 (or \(\mu_0\)), you reject at level \(\alpha\)
if the CI includes 0 (or \(\mu_0\)), you don’t reject
practical vs statistical significance#
A tiny effect can be statistically significant with a large sample; a meaningful effect can be non-significant with a small sample. Always look at the effect size + CI.
alpha = 0.05
def ci_excludes_zero(ci: tuple[float, float]) -> bool:
lo, hi = ci
return (lo > 0.0) or (hi < 0.0)
checks = {
"one_sample": {
"p<=alpha": res_1s["p_value"] <= alpha,
"ci_excludes_0": ci_excludes_zero(res_1s["ci"]),
},
"welch": {
"p<=alpha": res_welch["p_value"] <= alpha,
"ci_excludes_0": ci_excludes_zero(res_welch["ci"]),
},
"paired": {
"p<=alpha": res_paired["p_value"] <= alpha,
"ci_excludes_0": ci_excludes_zero(res_paired["ci"]),
},
}
checks
{'one_sample': {'p<=alpha': True, 'ci_excludes_0': True},
'welch': {'p<=alpha': False, 'ci_excludes_0': False},
'paired': {'p<=alpha': False, 'ci_excludes_0': False}}
# Under a true null, p-values should be ~Uniform(0,1).
# We'll simulate many one-sample t-tests where H0 is true.
n = 12
n_sim = 400
mu0 = 0.0
pvals = []
for _ in range(n_sim):
sample = rng.normal(loc=0.0, scale=1.0, size=n)
mean = float(sample.mean())
s = float(sample.std(ddof=1))
se = s / np.sqrt(n)
t_stat = (mean - mu0) / se
pvals.append(t_p_value(t_stat, n - 1, alternative="two-sided"))
fig = px.histogram(
x=pvals,
nbins=20,
title="p-values under a true null are (approximately) uniform",
labels={"x": "p-value"},
)
fig.add_hline(y=n_sim / 20, line_dash="dash", line_color="black", annotation_text="uniform baseline")
fig.show()
11) Assumptions, diagnostics, and pitfalls#
Assumptions (typical)#
Independence of observations (or of paired differences)
The sampling distribution of the mean is close enough to normal:
data are approximately normal, or
sample size is large enough for the CLT to kick in
Common pitfalls#
choosing one-sided vs two-sided after seeing the data
stopping data collection when
p < 0.05(optional stopping) without correctionmultiple testing without adjustment (p-hacking)
ignoring effect size and uncertainty (CI)
Alternatives when assumptions fail#
Permutation test on the mean difference (very interpretable)
Bootstrap CI for the mean / mean difference
Wilcoxon signed-rank (paired) or Mann–Whitney U (two-sample) for ordinal / heavy-tailed cases
# Insight: as n grows, the same effect becomes easier to detect.
alpha = 0.05
true_effect = 0.3 # mean shift in SD units
n_grid = np.array([5, 8, 12, 20, 30, 50, 80])
n_sim = 3000
powers = []
false_pos = []
for n in n_grid:
df = n - 1
t_crit = t_ppf_numeric(1 - alpha / 2, df)
# Under H0 (no effect)
x0 = rng.normal(loc=0.0, scale=1.0, size=(n_sim, n))
m0 = x0.mean(axis=1)
s0 = x0.std(axis=1, ddof=1)
t0 = m0 / (s0 / np.sqrt(n))
false_pos.append(float(np.mean(np.abs(t0) >= t_crit)))
# Under H1 (effect)
x1 = rng.normal(loc=true_effect, scale=1.0, size=(n_sim, n))
m1 = x1.mean(axis=1)
s1 = x1.std(axis=1, ddof=1)
t1 = m1 / (s1 / np.sqrt(n))
powers.append(float(np.mean(np.abs(t1) >= t_crit)))
fig = go.Figure()
fig.add_trace(go.Scatter(x=n_grid, y=false_pos, mode="lines+markers", name="Type I error (H0)"))
fig.add_trace(go.Scatter(x=n_grid, y=powers, mode="lines+markers", name=f"Power (effect={true_effect})"))
fig.add_hline(y=alpha, line_dash="dash", line_color="black", annotation_text="α")
fig.update_layout(
title="Sample size vs false-positive rate and power (two-sided t-test)",
xaxis_title="sample size n",
yaxis_title="probability",
yaxis=dict(range=[0, 1]),
)
fig.show()
12) Exercises#
Implement a permutation test for the two-sample mean difference and compare its p-value to Welch’s t-test.
Create data with one huge outlier and see how the t-test changes. Try a trimmed mean instead.
For the paired case, simulate scenarios where pairing helps a lot vs not at all (correlation between before/after).
References#
Student (1908), The probable error of a mean (original t-test paper)
Any intro stats text: sections on Student’s t distribution and t-tests